I was just curious to see which areas in Los Angeles use water the most. So I clicked over to Los Angeles Open Data, and got myself Water & Electricity Usage (2005 - 2013) data.
Let’s take a look.
First, note that the water is measured in Hundred Cubic Feet (HCF). 1 HCF = approx. 748 gallons
library(zoo)
library(plyr)
library(dplyr)
library(tidyr)
library(stringr)
library(zipcode)
library(ggplot2)
library(maps)
library(ggmap)
# load the data that was already downloaded
waterPower <- read.csv("data/Water_and_Electric_Usage_from_2005_-_2013.csv")
# Some clean up
# convert the date to Date format
waterPower$Date <- as.Date(as.yearmon(waterPower$Text.Date, "%b_%Y"))
# Split the zip to zip & geo codes
zipSplit <- ldply(str_split(waterPower$Zip.Code, "\n"))
geoSplit <- ldply(str_split(zipSplit$V2, ","))
geoSplit$V1 <- str_replace_all(geoSplit$V1, "\\(", "")
geoSplit$V1 <- str_replace_all(geoSplit$V1, " ", "")
geoSplit$V2 <- str_replace_all(geoSplit$V2, "\\)", "")
geoSplit$V2 <- str_replace_all(geoSplit$V2, " ", "")
waterPower$ZipCode <- zipSplit$V1
waterPower$Lon <- geoSplit$V1
waterPower$Lat <- geoSplit$V2
# Remove the old variables
waterPower$Zip.Code <- NULL
waterPower$Text.Date <- NULL
waterPower$Value.Date <- NULL
# Reorder the columns just for neatness sake
waterPower <- waterPower[, c(3, 4, 5, 6, 1, 2)]
# Clean up the temporary objects
rm(zipSplit)
rm(geoSplit)
Some years were missing some data, so I just decided to use only the 2011 data as it has the most complete data. I was also looking to add the population data from 2010 Census, so it works outs nicely since it’s not too far from the year 2010.
# Get just the 2011 data; Get total by zipcode
waterPower2011 <- waterPower %>%
filter(Date >= as.Date('2011-01-01')) %>%
filter(Date < as.Date('2012-01-01')) %>%
group_by(ZipCode) %>%
summarise(Water.Use = sum(Water.Use), Power.Use = sum(Power.Use)) %>%
arrange(desc(Water.Use))
# Load Population info
# Merge with Population table to get population
# The result is the data for just the Los Angeles
population <- read.csv("data/2010+Census+Population+By+Zipcode+(ZCTA).csv")
colnames(population) <- c("ZipCode", "Population")
population$ZipCode <- as.character(population$ZipCode)
waterPower2011 <- waterPower2011 %>%
inner_join(population, by="ZipCode")
# Then add Latitude, Longitude for each zip code
data(zipcode)
waterPower2011$Lon <- sapply(waterPower2011$ZipCode, function(x) zipcode$longitude[zipcode$zip == x])
waterPower2011$Lat <- sapply(waterPower2011$ZipCode, function(x) zipcode$latitude[zipcode$zip == x])
# Show map of Los Angeles
losAngeles <- get_map(location="Los Angeles", zoom=11)
# Water User Per Month
ggmap(losAngeles) + geom_point(data=waterPower2011, aes(x=Lon, y=Lat, size=Water.Use, alpha=0.5)) + scale_size(range=c(3, 15))
## Warning in loop_apply(n, do.ply): Removed 45 rows containing missing values
## (geom_point).
# Get top 15 Zip Codes
# waterPower2011 is already sorted by water.use in descening order, so just need to get top 15
waterTop15 <- head(waterPower2011, 15)
waterTop15$Water.Use.In.Gallons <- waterTop15$Water.Use * 748
waterTop15[, c("ZipCode", "Water.Use", "Water.Use.In.Gallons")]
## Source: local data frame [15 x 3]
##
## ZipCode Water.Use Water.Use.In.Gallons
## 1 90071 7782.35 5821197.8
## 2 91350 7627.80 5705594.4
## 3 90265 4611.75 3449589.0
## 4 90067 4218.42 3155378.2
## 5 90211 3794.00 2837912.0
## 6 91201 2486.00 1859528.0
## 7 90017 1845.90 1380733.2
## 8 90010 1456.67 1089589.2
## 9 90012 1196.28 894817.4
## 10 90057 1084.56 811250.9
## 11 90094 1073.38 802888.2
## 12 90502 996.87 745658.8
## 13 90020 980.17 733167.2
## 14 90014 979.05 732329.4
## 15 90744 854.92 639480.2
# Water Use Per Person (based on the corresponding 2010 census data)
waterPower2011$Water.Use.Per.Person <- signif(waterPower2011$Water.Use / waterPower2011$Population, 4)
# Water Use Per Person Per Month in gallons
waterPower2011$Water.Use.Per.Month.Per.Person.In.Gallons <- signif(((waterPower2011$Water.Use / 12) / waterPower2011$Population) * 748, 4)
waterTop15 <- waterPower2011 %>%
select(ZipCode, Population, Water.Use.Per.Month.Per.Person.In.Gallons) %>%
arrange(desc(Water.Use.Per.Month.Per.Person.In.Gallons))
head(waterTop15, 15)
## Source: local data frame [15 x 3]
##
## ZipCode Population Water.Use.Per.Month.Per.Person.In.Gallons
## 1 90071 15 32340.000
## 2 90067 2424 108.500
## 3 90211 8434 28.040
## 4 91210 328 26.040
## 5 90010 3800 23.890
## 6 90265 18116 15.870
## 7 91350 33348 14.260
## 8 90094 5464 12.250
## 9 90014 7005 8.712
## 10 90021 3951 8.489
## 11 91201 22781 6.802
## 12 90058 3223 6.117
## 13 90017 23768 4.841
## 14 90013 11772 3.494
## 15 90502 18010 3.450
There’s gotta be something wrong with this. The whole thing!
First. For the zip code 90071, the population is only 15 and SO MUCH WATER used!! Where is this??
With the help of some extra data from Los Angeles County GIS Data Portal, I was able to mark the area each zip code covers. Let’s take a look.
library(rgdal)
library(gpclib)
library(maptools)
LAZIP <- readOGR("data/LA_Zip_Shapefiles_Street", "CAMS_ZIPCODE_STREET_SPECIFIC")
zipAreas <- spTransform(LAZIP, CRS("+proj=longlat +datum=WGS84"))
gpclibPermit() # Set gpclibPermitStatus() to TRUE
# Include the zip code data when building the data frame
zipData <- fortify(model=zipAreas, region="Name")
# Now all ready to draw the map with the zip code boundary.
thisZip <- "90071"
# Subset the specified zip codes
thisZipData <- subset(zipData, id == thisZip)
ggmap(get_map(location=thisZip, zoom=15, maptype="roadmap")) +
geom_polygon(aes(x=long, y=lat, group=group), fill='grey', size=.4, color='black', data=thisZipData, alpha=0.5) +
labs(title=thisZip)
It’s part of downtown Los Angeles. It’s mostly commercial. There are businesses like Bank of America, The Westin Bonaventure Hotel, UnionBank, Chase Bank, Citibank, The Standard Hotel, Hilton Hotel, etc. The fact that there are several big hotels here may explain the water usage, however, that’s still a lot of water!
Now let’s see the area where 23,768 people live and only use 4.841 gallons of water per person per month.
thisZip <- "90017"
# Subset the specified zip codes
thisZipData <- subset(zipData, id == thisZip)
ggmap(get_map(location=thisZip, zoom=15, maptype="roadmap")) +
geom_polygon(aes(x=long, y=lat, group=group), fill='grey', size=.4, color='black', data=thisZipData, alpha=0.5) +
labs(title=thisZip)
Funny how it’s right next to 90071. It covers area that’s a lot bigger and it includes lots of residential areas with big apartment complexes. But how is that they only use 4.84 gallons of water per person PER MONTH?? That’s basically about five flushes of your toilet! So either the population is incorrect or the water usage number is incorrect. Or could it be that vast part of the area is served by a water company other than LA Department of Water and Power? I did a quick web search on that and it seems unlikely.
Since we’re on the topic of water usage PER PERSON, let’s take some guesses to see what would be realistic.
# Let me make some real rough guesses
# Toilet Flush: 1 in the morning; 1 at night; 1 flush = 1 gallon;
flushes <- 2 * 1
# Shower: 1 shower per day; 10 minute shower; 1.5 gallons per minute for typical shower heads;
shower <- 1 * (10 * 1.5)
# Brush teeth; Wash hands; Let's just say 1 gallon per day;
handwash <- 1
# Prepare food; Do dishes: Kitchen faucet = 2 gallons per minute; On and off usage totaling about 15 minutes;
dishwash <- 2 * 15
# Laundry: 30 gallons per load; About 3 loads per month;
laundry <- (30 * 3) / 30 # Divided by 30 to get daily average;
# Misc: 0.5 gallons per day; Whatever else there is like watering plants, etc.
misc <- 0.5
# Let's not include anything outdoor like lawns, washing driveway, etc. for simplicity.
# 30 days in a month
(flushes + shower + handwash + dishwash + laundry + misc ) * 30
## [1] 1545
That’s 1,545 gallons per person! Obviously this is a really rough guess, but probably not too far off. So it’s safe to say per person water usage should be at least about 1,000 gallons per month.
Let’s try removing that one outlier and see how the PER person usage is throughout the city of Los Angeles:
# Let's remove that one outlier to get a better sense of the overall usage
waterPower2011 <- subset(waterPower2011, ZipCode != "90071")
ggmap(losAngeles) + geom_point(data=waterPower2011, aes(x=Lon, y=Lat, size=Water.Use.Per.Month.Per.Person.In.Gallons, alpha=0.5)) + scale_size(range=c(3,15))
So the per person usages is about the same mostly throughout the city. But then again, it’s not too helpful knowing that the data might be incorrect one way or another. The population data could be wrong as the zip code approximately assigned to whatever that could be made out from the parcel data. It could also be possible that LADWP only released parts of measurements.
Regardless, this was done just for fun and not for some serious research. I dealt with the missing values by just removing them instead of imputing as it wouldn’t have matterd much either way for this particular analysis.